In depth search of the Sequence Read Archive database reveals global distribution of the emerging pathogenic fungus Scedosporium aurantiacum

Abstract Scedosporium species are emerging opportunistic fungal pathogens causing various infections mainly in immunocompromised patients, but also in immunocompetent individuals, following traumatic injuries. Clinical manifestations range from local infections, such as subcutaneous mycetoma or bone and joint infections, to pulmonary colonization and severe disseminated diseases. They are commonly found in soil and other environmental sources. To date S. aurantiacum has been reported only from a handful of countries. To identify the worldwide distribution of this species we screened publicly available sequencing data from fungal metabarcoding studies in the Sequence Read Archive (SRA) of The National Centre for Biotechnology Information (NCBI) by multiple BLAST searches. S. aurantiacum was found in 26 countries and two islands, throughout every climatic region. This distribution is like that of other Scedosporium species. Several new environmental sources of S. aurantiacum including human and bovine milk, chicken and canine gut, freshwater, and feces of the giant white-tailed rat (Uromys caudimaculatus) were identified. This study demonstrated that raw sequence data stored in the SRA database can be repurposed using a big data analysis approach to answer biological questions of interest. Lay summary To understand the distribution and natural habitat of S. aurantiacum, species-specific DNA sequences were searched in the SRA database. Our large-scale data analysis illustrates that S. aurantiacum is more widely distributed than previously thought and new environmental sources were identified.


Introduction
Scedosporium is a genus of fungi in the Microascaceae family of the Ascomycota and compromises saprotrophic mold species, mainly living on decaying organic matter and are found in soil, sewage, and contaminated water. 1 , 2 The genus currently includes ten species ( S. aurantiacum, S. minutisporum, S. desertorum, S. cereisporum, S. dehoogii, S. angustum, S. apiospermum, S. boydii, S. ellipsoideum , and S. fusarium ). Five of them have been found to be clinically relevant: Scedosporium apiospermum, S. boydii, S. aurantiacum, S. dehoogii and S. minutisporum . 3 Scedosporium species can cause localized and severe disseminated infections depending on the immune status of the host. 4 They are responsible for 25% of non-Aspergillus mold infections in organ transplant recipients in the USA and are associated with the occurrence of major trauma. 5 -7 They have also been reported from patients with pulmonary conditions, such as cystic fibrosis, but their significance in these conditions is uncertain. 8 -10 Among them, S. boydii and S. apiospermum are the most frequently isolated species, but in some regions S. aurantiacum is more common. 11 S. aurantiacum is an opportunistic pathogen capable of causing a wide variety of localized and superficial infections, such as malignant otitis externa, osteomyelitis, invasive sinusitis, keratitis, and pneumonia. 7 , 12 S. aurantiacum, separated from other Scedosporium species by molecular markers, such as β-tubulin , calmodulin, and the internal transcribed spacer (ITS) region, was first proposed as a new species in 2005. 13 Several studies have been undertaken to describe the ecology and environmental distribution of different Scedosporium species mainly in Europe, such as in France, 14 Austria, and The Netherlands, 1 as well as in Australia, 11 Thailand, 15 , 16 Mexico 17 and Morocco. 18 The distribution of the Scedosporium spp. indicated geographical differences, 1 , 14 with S. aurantiacum to be mainly abundant in Australia 11 and in agricultural areas in the west of France, 14 with additional reports from The Netherlands, Morocco, Thailand and Mexico. 1 , 15 -18 In Australia, more than 50% of all environmental Scedosporium isolates were S. aurantiacum, which coincides with the relative high prevalence of scedosporiosis and their presence as colonizers in CF patients in Australia. 12 , 19 In total, clinical isolates of S. aurantiacum have been reported from ten countries, including Australia, Austria, France, Germany, Italy, Japan, Netherlands, South Korea, Spain, and United States of America. In comparison, environmental isolates of S. aurantiacum have been reported from 14 countries: Australia, Austria, France, Germany, Italy, Latvia, Mexico, Morocco, Nepal, Netherlands, Russia, Spain, Thailand, and UK, where it was mainly reported from soil, compost and sewage water. 1 , 11 , 14 -18 , 20 -23 To expand the knowledge of the environmental distribution of microorganisms, metabarcoding has become the main tool used to characterize complex microbial and other communities from microbial ecology studies to infectious disease surveillance. 24 -26 DNA metabarcoding is the simultaneous identification of a large set of taxa present in a single complex sample. 27 The approach combines the concept of DNA barcoding 28 with the application of next generation sequencing (NGS). It uses short DNA sequences (barcodes) to standardize the identification of organisms from all kingdoms down to species level by comparison to a reference sequence collection of well identified species. 28 , 29 Developments in NGS sequencing has made it possible to generate and analyze millions of targeted amplicons (barcodes) amplified by polymerase chain reaction (PCR) from thousands of mixed DNA templates within the same sample simultaneously to determine the species composition of the sample. 29 Metabarcoding is currently the standard tool and the most efficient method for culture-independent assessment of microbiomes. 30 In fungi, the internal transcribed spacer (ITS) region was established as the primary fungal DNA barcode in 2012. 31 This is due to its multicopy nature and its easy amplification with universal primers that are compatible with most fungal species. 31 , 32 It has been extensively used in both molecular systematics and ecological studies in fungi over three decades. 2 , 33 , 34 The ITS region consists of the ITS1 and ITS2 regions separated by the 5.8S gene and is located between the 18S (SSU) and 28S (LSU) genes in the nrDNA repeat unit. 33 With traditional Sanger sequencing the entire ITS region, which ranges between 280 and 800 bp, has been targeted for molecular identification purposes. 35 However, in metabarcoding studies, either the ITS1 or ITS2 region has been amplified and sequenced by NGS technologies, due to the fact that the entire ITS region is too long for commonly used sequencing platforms, such as Illumina, Ion Torrent or the phased out 454 sequencing from Roche. 2 , 36 As molecular identification of various microbial samples has become an essential part of different studies worldwide it has provided new insights into the diversity and ecology of many different fungal communities (mycobiome). 37 , 38 As a result, large amounts of partial ITS sequences have been generated by NGS and deposited in public sequence databases, such as the Sequence Read Archive (SRA) of the National Institutes of Health (NIH), which is the primary international public archive of highthroughput sequencing data established under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC). 39 SRA stores raw sequence data from different NGS technologies, including Roche 454, Illumina, Ion Torrent, Pacific Biosciences and Oxford Nanopore Technologies. SRA has the largest, most diverse collection of NGS data from human, nonhuman and microbial sources.
The current study screened the publicly available metabarcoding data of NIH's SRA database containing fungal sequence data to identify the geographical distribution, and potential ecological sources and reservoirs of the emerging human pathogenic fungus S. aurantiacum , serving as pilot study to highlight the potential of repurposing of publicly available raw sequence datasets to answer major biological and public heath questions.

Methods
All data used in this study are publicly available in the SRA database ( https://www.ncbi.nlm.nih.gov/sra ). In this study, a subset of SRA datasets containing the ITS1 or ITS2 sequences from fungal metabarcoding studies were identified (192 117) as of June 2020 by using the following keywords: 'fungi', 'fungal diversity' and 'ITS region' on the web interface of SRA database. The query outputs were combined, and duplicate datasets were removed based on their unique identification number.
The SRA toolkit version 2.10.7 40 and the basic local alignment search tool (BLAST) implemented in the toolkit 41 were used to screen and identify the datasets containing S. aurantiacum ITS sequences. The query sequence contained the full ITS region (ITS1 + 5.8S + ITS2) and partial SSU and LSU sequences (totalling 661 bp), which was extracted from the contig of the whole-genome assembly of the S. aurantiacum strain WM 09.24 (GenBank Accession number: JUDQ01000713.1). The herein used similarity identity threshold for the BLAST analysis was 99% and the E-value was set to less than 1E-80 to minimize the false positive hits. The identified sequence data from positive matches containing either the ITS1 or ITS2 region of S. aurantiacum were then manually checked.
All the metadata associated and available for the S. aurantiacum positive SRA datasets (Supplementary Table 1), including information about their geographical locations and isolation sources, were downloaded from the SRA database. In some cases, the metadata was incomplete in the SRA database, which prompted screening the relevant publications associated with the SRA data to extract the metadata.
The following databases PubMed, Scopus, Web of Science, and Google Scholar as of 31 of July 2020 were screened to obtain published data about the occurrence and ecological distribution of S. aurantiacum in clinical and environment samples using the keyword S. aurantiacum . In addition, the Nucleotide database of NCBI, Westmead Mycology Culture Collection and the Culture collection of fungi and yeasts of Westerdijk Fungal Biodiversity Institute was screened for additional clinical and environmental isolates of S. aurantiacum.
Individual geographical locations obtained from the S. aurantiacum positive SRA datasets, together with the published unique locations of clinical and environmental occurrence of S. aurantiacum were plotted on the world map using the QGIS, geographic information system (version 3.10.9-A Coruña with Grass 7.8.3). 42

Results
The described database search identified 1706 SRA sequence data sets that contained either the ITS1 or ITS2 region of S. aurantiacum (Supplementary Table 1). After assessing the associated metadata together with the published unique locations of clinical and environmental occurrence of S. aurantiacum (Table 1 ) they were plotted on the world map using the QGIS software ( Figure 1 ). The obtained results from screening the SRA database indicate that S. aurantiacum has a wide geographic distribution ( Figure 1 ). All in all, S. aurantiacum was identified in 26 countries and two islands (Reunion and Christmas Island) ( Table 1 ). Among them, S. aurantiacum has not been reported before in: Afghanistan, Belgium, Brazil, Canada, China, Christmas Island, Costa Rica, Czech Republic, El Salvador, Finland, Israel, New Zealand, Philippines, Portugal, Reunion, Singapore, Switzerland, and United Kingdom. The highest number of S. aurantiacum positive SRA data were from China (965), followed by the United Kingdom (241) and Australia (135). The environmental sources of the S. aurantiacum positive SRA data included mainly various soils, sludge, and sediment samples (88% of the samples) ( Table 2 ). The herein reported study also identified several new sources from which S. aurantiacum had not yet been reported, such as human and bovine milk, chicken and canine gut, freshwater, and feces of the giant white-tailed rat ( Uromys caudimaculatus ) ( Table 2 ).

Discussion
So far, S. aurantiacum has been reported from only a few countries, with limited studies being done to assess its global distribution. Environmental isolates of S. aurantiacum have only been reported previously from Australia, 11 France, 3 , 14 The Netherlands, 1 Morocco, 18 Thailand 15 ,16 and Mexico. 17 Clinical reports of S. aurantiacum have previously not demonstrated  any association with environmental isolates of the same species. Till now both clinical and environmental isolates have been reported only from Australia, 11 , 12 Austria, 1 France, 3 , 14 and The Netherlands. 1 , 43 Clinical cases of S. aurantiacum have been reported from Japan, 44 South Korea, 45 and Spain, 46 while environmental isolates have been reported from Italy, 20 Mexico, 17 Morocco 18 and Thailand. 15 , 16 The present study searched the publicly available raw sequence data of the NCBI SRA database to assess the geographical distribution and environmental niches and reservoirs of the emerging fungal pathogen S. aurantiacum . It identified the occurrence of S. aurantiacum in 16 additional countries and two islands from where it had not been reported previously ( Table 1 ). The highest number of locations was found in datasets from China, the United Kingdom and Australia (Table 1 ). However, it is important to note that this high numbers are very likely due to extensive number of metabarcoding studies carried out in these countries. As metabarcoding studies are still relatively expensive ( ∼$100 US per sample) they are still infeasible in many countries. The obtained results suggest that S. aurantiacum has a wide distribution rather than being limited to certain countries. One of the reasons S. aurantiacum has not been reported more often could be possible misidentification since this species cannot be morphologically distinguished from the closely related species S . apiospermum, as it was only recently described on the basis of sequence analysis of a number of genetic loci. 13 As such, it can be assumed that many routine clinical laboratories, in which molecular identification methods are not available or too expensive, will misidentify this species. Another reason could be that many countries have not reported S. aurantiacum infections in scientific papers despite correctly identifying them. For example, a recent study about the identification and susceptibility of clinically relevant Scedosporium spp. in China has not reported any S. aurantiacum isolates, 47 which is in sharp contrast with the herein obtained metabarcoding based findings.
The screening of the SRA database also showed that the distribution of S. aurantiacum does not show any clear relationship with climate conditions, as the obtained results show that S. aurantiacum specific sequences have been found in metabarcoding datasets obtained in samples from temperate, arid, and tropical zones, as well as in the Mediterranean and tundra regions.
The environmental sources of S. aurantiacum as identified in the current study remain predominantly various soils, sewage and sediments as has been reported previously. 1 , 3 , 16 -18 The current study also identified additional sources, such as human and bovine milk, chicken and canine gut, freshwater, and feces of the giant white-tailed rat ( Uromys caudimaculatus ).
Having shown that S. aurantiacum has a wide distribution it is important to see the current study in the light of its biases and limits. To discuss these biases in detail is out of the scope of this paper. However, a non-exhaustive list includes statistical sampling error, sequencing error, and the BLAST algorithm itself. 48 -50 From the technological side of metabarcoding, there are many well documented technical artifacts, including DNA extraction and amplification as well as PCR biases, which can result in the non-detection of certain species even if they are present in the samples. 51 -56 Another potential source of bias and error are the bioinformatic tools used, e.g., the BLAST algorithm and the SRA database search function. Despite being the most widely used alignment based sequence similarity search algorithm 57 it comes with major disadvantages, being generally memory and time consuming, limiting its use for large-scale sequence data. The selection of relevant subset data from the complete SRA database ( ∼18 petabytes) is not without any challenge. Although, many scientific journals require submitting raw sequence data to the SRA database prior publication, there are few standards about how much associated metadata should be submitted together with the raw sequence data. In a number of cases, this practice resulted in insufficient or incomplete metadata sets associated with the raw sequence data, which makes the subsequent filtering process challenging and incomplete. Sometimes, there is not even any information submitted whether the dataset contains fungal ITS sequence or not. In other cases, the metadata is only available in the publication but not in the SRA database.
Overall, the current study identified 192 117 publicly available datasets containing either ITS1 or ITS2 sequences. With a rough estimation of about $100 US sequencing cost per sample, the herein presented study screened ∼$19.21 million US worth of sequence data from many countries to assess the global ecological distribution of an emerging opportunistic fungal pathogens. This study about the emerging human pathogen, S. aurantiacum massively expanded our knowledge of its natural reservoir as the potential for being the source of human infection. The herein described wider environmental presence if this human pathogen alerts public health authorities to pay attention to these potential infection sources, when accessing the risk for vulnerable individuals. It highlights the potential application of the SRA database to search for the geographical and environmental distribution of fungal species or in fact any microorganism to answer questions about disease reservoirs, potentially enabling the prediction of outbreaks and to increase the preparedness of public health authorities. It should be viewed as a pilot study using the vast hidden treasure of the SRA database to answer certain biological questions.

Supplementary material
Supplementary material is available at MMYCOL online.